Univariate exploratory data analysis
GEOG 30323
September 22, 2015
The data analysis process
Adapted from H. Wickham
Exploratory data analysis
- “Detective work” to summarize and explore datasets
Includes:
- Data acquisition and input
- Data cleaning and wrangling (“tidying”)
- Data transformation and summarization
- Data visualization
Your core Python tools for EDA: NumPy, pandas, and seaborn/matplotlib
NumPy
- Extension to Python; the core Python package for numerical computing
- Standard import:
import numpy as np
- Data structure: the NumPy array. Sort of like a list - but with more methods, and can be multidimensional
import numpy as np
y = np.array([[2, 4, 6, 8, 10, 12],
[1, 3, 5, 7, 9, 11],
[10, 12, 14, 18, 22, 14],
[9, 3, 3, 3, 3, 1]])
Pandas
- Built on top of NumPy; adds support for table-like data structures in Python
- Standard import:
import pandas as pd
- Sequences of data are stored as Series objects, which collectively form DataFrames
import pandas as pd
df = pd.DataFrame(y, columns = list('x' + str(num) for num in range(1, 7)))
The pandas DataFrame
- Commonly, DataFrames are created by reading in external data, like CSV files
# To read in CSV files, we use the pd.read_csv function
grad = pd.read_csv('grad_rates.csv')
The pandas DataFrame
- Each observation forms a row, defined by an index; attributes of those observations are found in the columns of the DataFrame

- Columns are accessible as indices, e.g.
grad['rate'], or as attributes of the data frame, e.g. grad.rate
Levels of measurement
- Nominal: qualitative, descriptive, categories
- Ordinal: ordering or ranking; however, no information about distance between ranks
- Interval: additive; no natural zero (zero is a meaningful value)
- Ratio: multiplicative; natural zero (zero means an absence of a value)
Make sure you know your column types (dtypes) and levels of measurement before doing analysis!
Measures of central tendency
- Mode: the most typical value in a distribution
- Median: the “balancing point” in a distribution (50 percent of observations above and below)
- Mean: the arithmetic average of a distribution
The mean of a sample (\(\overline{x}\)) is calculated as follows:
\[\overline{x} = \dfrac{x_1 + x_2 + ... + x_n}{n}\]
where \(n\) is the number of elements in the sample.
Measures of dispersion
- Range: difference between maximum and minimum values in a distribution
- Interquartile range: difference between the values at the 25 percent and 75 percent points in a distribution
- Variance and standard deviation
Variance
- A measure of the spread of a sample. The variance is computed as:
\[{\sigma}^2 = \dfrac{\sum\limits_{i=1}^{n}(x_i - \overline{x})^2}{n}\]
or, in simpler terms, the average of the squared deviations of the values of a sample from its mean.
Standard deviation
- Computed as the square root of the variance; denoted by \(\sigma\).
- Offers a standardized way to discuss the spread of a distribution. For example, in a normally distributed sample:
- About 67 percent of the values will be within one standard deviation of the mean
- About 95 percent of the values will be within two standard deviations of the mean
- About 99 percent of the values will be within three standard deviations of the mean
Descriptive statistics in pandas
- Descriptive stats are available in
pandas as data frame methods, e.g. grad.mean(), grad.std()
- Calling
.describe() will give you back a number of important descriptive stats at once
grad.describe()

Exploratory visualization
- Often, when exploring a dataset, you’ll want to use graphical representations of your data to help reveal insights/trends
- Visualization: Graphical representation of data
Visualization in Python
- Core visualization package in Python:
matplotlib - which comes pre-installed with Anaconda
To show matplotlib graphics in your Jupyter Notebook, type %matplotlib inline
seaborn: extension to matplotlib to make your graphics look nicer! Seaborn is available from Anaconda but not pre-installed. To install from the command line, type conda install seaborn
Standard import: I use import seaborn as sb, the creator uses import seaborn as sns.
Histograms
- Histogram: graphical representation of a frequency distribution
- Observations are organized into bins, and plotted with values along the x-axis and the number of observations in each bin along the y-axis
- Normal distribution: histogram is approximately symmetrical (a “bell curve”)
- Histograms are built into
pandas
Example histogram
%matplotlib inline
import seaborn as sb
grad.rate.hist()

Density plots
- Smooth representations of your data can be produced with kernel density plots
- Accessible from both
pandas and seaborn
sb.kdeplot(grad.rate, shade = True)

Box plots
- Also termed “box and whisker plots” - alternative way to show distribution of values graphically
sb.boxplot(grad.rate, color = "green")

Anatomy of a box plot

- Dots beyond the whiskers: outliers
Violin plots
- Combinations of box plots and kernel density plots
sb.violinplot(grad['rate'], color = 'cyan')
